Search CORE

9 research outputs found

Encryption by using base-n systems with many characters

Author: Hoenen Armin
Publication venue
Publication date: 04/06/2023
Field of study

It is possible to interpret text as numbers (and vice versa) if one interpret letters and other characters as digits and assume that they have an inherent immutable ordering. This is demonstrated by the conventional digit set of the hexadecimal system of number coding, where the letters ABCDEF in this exact alphabetic sequence stand each for a digit and thus a numerical value. In this article, we consequently elaborate this thought and include all symbols and the standard ordering of the unicode standard for digital character coding. We show how this can be used to form digit sets of different sizes and how subsequent simple conversion between bases can result in encryption mimicking results of wrong encoding and accidental noise. Unfortunately, because of encoding peculiarities, switching bases to a higher one does not necessarily result in efficient disk space compression automatically.Comment: 12 pages, 6 figure

arXiv.org e-Print Archive

Language classification from bilingual word embedding graphs

Author: Eger Steffen
Hoenen Armin
Mehler Alexander
Publication venue
Publication date: 10/10/2016
Field of study

We study the role of the second language in bilingual word embeddings in monolingual semantic evaluation tasks. We find strongly and weakly positive correlations between down-stream task performance and second language similarity to the target language. Additionally, we show how bilingual word embeddings can be employed for the task of semantic language classification and that joint semantic spaces vary in meaningful ways across second languages. Our results support the hypothesis that semantic language similarity is influenced by both structural similarity as well as geography/contact.Comment: To be published at Coling 201

arXiv.org e-Print Archive

TUbiblio

An open problem in computational stemmatology - a model for contamination

Author: Hoenen Armin
Publication venue
Publication date: 01/01/2019
Field of study

In this contribution, two open problems in computational stemmatology are being considered. The first one is contamination, an umbrella term referring to all phenomena of admixture of text variants resulting from scribes considering more than one manuscript or even memory when copying a text. This problem is one of the biggest to date in stemmatology since it implies an entirely different formal approach to the reconstruction of the copy history of a tradition and in turn to the reconstruction of an urtext. (Maas 1937) famously stated that there is no remedy against contamination and (Pasquali and Pieraccioni 1952) coined the terms 'open' vs. 'closed' recensions to distinguish contaminated from uncontaminated. We present a graph theoretical model which formally accommodates traditions with any degree of contamination while maintaining a temporal ordering and give combinatorial numbers and formula on the implication for numbers of possible scenarios

AlmaDL Journals

Hochschulschriftenserver - Universität Frankfurt am Main

Tools, evaluation and preprocessing for stemmatology

Author: Hoenen Armin (Magister Artium)
Publication venue
Publication date: 09/03/2018
Field of study

Die vorliegende Arbeit beschäftigt sich mit dem Thema Stemmatologie, d.h. primär der Rekonstruktion der Kopiergeschichte handschriftlich fixierter Dokumente. Zentrales Objekt der Stemmatologie ist das Stemma, eine visuelle Darstellung der Kopiergeschichte, welche i.d.R. graphtheoretisch als Baum bzw. gerichteter azyklischer Graph vorliegt, wobei die Knoten Textzeugen (d.s. die Textvarianten) darstellen während die Kanten für einzelne Kopierprozesse stehen. Im Mittelpunkt des Wissenschaftszweiges steht die Frage des Autorenoriginals (falls ein einziges solches existiert haben sollte) und die Frage der Rekonstruktion seines Textes. Das Stemma selbst ist ein Mittel zu diesem Hauptzweck (Cameron 1987). Der durch für manuelle Kopierprozesse kennzeichnende Abweichungen zunehmend abgewandelte Originaltext ist meist nicht direkt überliefert. Ziel der Arbeit ist es, die semi-automatische Stemmatologie umfassend zu beschreiben und durch Tools und analytische Verfahren weiterzuentwickeln. Der erste Teil der Arbeit beschreibt die Geschichte der computer-assistierten Stemmatologie inkl. ihrer klassischen Vorläufer und mündet in der Vorstellung eines einfachen Tools zur dynamischen graphischen Darstellung von Stemmata. Ein Exkurs zum philologischen Leitphänomen Lectio difficilior erörtert dessen mögliche psycholinguistische Ursachen im schnelleren lexikalischen Zugriff auf hochfrequente Lexeme. Im zweiten Teil wird daraufhin die existenziellste aller stemmatologischen Debatten, initiiert durch Joseph Bédier, mit mathematischen Argumenten auf Basis eines von Paul Maas 1937 vorgeschlagenen stemmatischen Models beleuchtet. Des Weiteren simuliert der Autor in diesem Kapitel Stemmata, um den potenziellen Einfluss der Distribution an Kopierhäufigkeiten pro Manuskript abzuschätzen. Im nächsten Teil stellt der Autor ein eigens erstelltes Korpus in persischer Sprache vor, welches ebenso wie 3 der bekannten artifiziellen Korpora (Parzival, Notre Besoin, Heinrichi) qualitativ untersucht wird. Schließlich wird mit der Multi Modal Distance eine Methode zur Stemmagenerierung angewandt, welche auf externen Daten psycholinguistisch determinierter Buchstabenverwechslungswahrscheinlichkeiten beruht. Im letzten Teil arbeitet der Autor mit minimalen Spannbäumen zur Stemmaerzeugung, wobei eine vergleichende Studie zu 4 Methoden der Distanzmatrixgenerierung mit 4 Methoden zur Stemmaerzeugung durchgeführt, evaluiert und diskutiert wird

Hochschulschriftenserver - Universität Frankfurt am Main

A Manual for Web Corpus Crawling of Low Resource Languages

Author: Hoenen Armin
Koc Cemre
Rahn Marc Daniel
Publication venue: AIUCD ; FICLIT
Publication date: 18/05/2020
Field of study

Since the seminal publication of “Web as Corpus” [1], the potential of creating corpora from the web has been realized for good for the creation of both online and offline corpora: noisy vs. clean, balanced vs. convenient, annotated vs. raw, small vs. big are only some antonyms that can be used to describe the range of possible corpora that can be and have been created. In our case, in the wake of the project Under Resourced Language Content Finder (URLCoFi), we describe a systematic approach to the compilation of corpora for low (or under) resource(d) languages (LRL) from the web in connection with a free eLearning course funded by studiumdigitale at Goethe University, Frankfurt. Despite the ease of retrieval of documents from the web, some characteristics of the digital medium introduce certain difficulties. For instance, if someone was to collect all documents on the web in a certain language, firstly, the collection could only be a snapshot since the web constantly changes content and secondly, there would be no way to ascertain completeness. In this paper, we show ways to deal with such difficulties in search scenarios for LRLs presenting experiences springing from a course about this topic.[1] A. Kilgarriff and G. Grefenstette, “Web as corpus,” in Proceedings of Corpus Linguistics 2001, 2001, pp. 342–344

AlmaDL Journals

A practitioner’s view: a survey and comparison of lemmatization and morphological tagging in German and Latin

Author: Eger Steffen
Gleim Rüdiger
Hemati Wahed
Henlein Alexander
Hoenen Armin
Kahlsdorf Sven
Lücking Andy
Mehler Alexander
Uslu Tolga
Publication venue: 'Institute of Computer Science, Polish Academy of Sciences'
Publication date: 01/01/2019
Field of study

The challenge of POS tagging and lemmatization in morphologically rich languages is examined by comparing German and Latin. We start by defining an NLP evaluation roadmap to model the combination of tools and resources guiding our experiments. We focus on what a practitioner can expect when using state-of-the-art solutions. These solutions are then compared with old(er) methods and implementations for coarse-grained POS tagging, as well as fine-grained (morphological) POS tagging (e.g. case, number, mood). We examine to what degree recent advances in tagger development have improved accuracy – and at what cost, in terms of training and processing time. We also conduct in-domain vs. out-of-domain evaluation. Out-of-domain evaluation is particularly pertinent because the distribution of data to be tagged will typically differ from the distribution of data used to train the tagger. Pipeline tagging is then compared with a tagging approach that acknowledges dependencies between inflectional categories. Finally, we evaluate three lemmatization techniques

TUbiblio

Biblioteka Nauki - repozytorium artykuÅÃ³w

Handbook of Stemmatology: History, Methodology, Digital Approaches

Author: Andrews Tara
Buzzoni Marina
Conti Aidan
Goransson Elisabet
Haugen Odd Einar
Hoenen Armin
Macè Caroline
Roelli Philipp
van Zundert Joris.
Publication venue: place:Berlin
Publication date: 01/01/2020
Field of study

Stemmatology studies the aspects of textual criticism using genealogical methods to analyse a set of copies from a text whose autograph is lost. As an art (ars) stemmatology has its main goal in editing, and thus presenting to the reader, such a text in the most satisfactory way; as a more abstract discipline (scientia) it is interested in the general principles of how texts change in the process of being copied. This handbook provides the first coverage of the entire field: theoretical and practical aspects of traditional and modern digital methods. Thirty eight experts from all involved fields joined forced to write the book which covers in forty one sections topics from material aspects of text traditions, through methods of traditional textual criticism, to modern digital methods used in the field. The two final chapters provide closer views of how the approach towards texts and textual criticism has developed in some well-defined disciplines of textual scholarship and compare methods used in other fields dealing with "descent with modification", respectively. Illustrations with many practical examples from a wide range of disciplines are provided to render the content more accessible. The intended readership comprises both students of various fields involved with texts and more advanced scholars

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Handbook of Stemmatology: History, Methodology, Digital Approaches

Author: Andrews Tara
Buzzoni Marina
Conti Aidan
Goransson Elisabet
Haugen Odd Einar
Hoenen Armin
Macè Caroline
Roelli Philipp
van Zundert Joris.
Publication venue: Walther de Gruyter
Publication date
Field of study